Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796
Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796Robby955 wants to merge 4 commits intoopenai:mainfrom
Conversation
3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).
|
nice 🔥🔥🔥🔥 |
Add complementary training (from @pentxayc #803) and per-order multipliers (from @AayushBaniya2006 #809) on top of distributed prefill + 15-gram + order-adaptive gating. New 3-seed results: 0.28798 / 0.28804 / 0.28810 All seeds under 16MB, training under 560s, eval under 330s. Updated README with legality hedge, full ablation, credits.
CRITICAL FIX: Previously each of 8 GPU ranks only updated its n-gram cache with its own 1/8 of scored windows. Now ALL ranks update with the FULL chunk (same as mixer already does). PR openai#796 showed this costs ~0.31 BPP: "Without pre-fill, ranks 1-7 start with empty n-gram caches. This costs ~0.31 BPP." Expected: massive improvement from 8x more n-gram data per rank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-chunk n-gram cache sharing: 0.6913 -> 0.5865 (-0.105 BPB) This confirms PR openai#796's finding that rank-local caches lose ~0.1+ BPB. WARNING: artifact=16.25MB (over 16MB limit for this seed). Need to increase pruning from 3% to 4%, or reduce bigram_vocab_size, to ensure all seeds fit. Eval time: 492s (within budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ipliers Novel improvement over uniform entropy threshold: - Per-order entropy center: order 2 → 5.0 (trust only when confused), order max → 2.0 (trust even when model is OK) - Per-order alpha multiplier: order 2 → 0.3× (suppress noise), order max → 2.0× (boost precision) - Linear interpolation between orders for smooth transition Inspired by PR openai#796's ablation showing -0.182 BPP from order-adaptive gating alone. Our implementation is continuous (sigmoid per order) rather than discrete thresholds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… validated) Replace per-order multipliers with recursive Dirichlet posterior predictive. Neural model as informative prior, single concentration c=5.0. 3-seed mean: 0.22923 BPB (std 0.000005). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Updated submission: 0.6567 → 0.2880 → 0.2292 BPB (3-seed mean, std 0.000005). Replaced per-order multipliers with Dirichlet-Multinomial posterior smoothing (single concentration c=5.0). All logs, code, and submission.json updated in latest commit. |
|
This is one of the cleanest submissions in the competition. Replacing 14 hand-tuned per-order alpha parameters with a single Dirichlet concentration (c=5.0) is elegant — the recursive posterior predictive naturally handles sparsity at high orders without any manual intervention. The math does what entropy thresholds and sigmoid gating are trying to approximate. The 3-seed std of 0.000005 is also remarkable — tightest we've seen across all submissions. Nice work. |
|
Superseded by neural-track work. |
Record: Empirical Bayes N-gram Mixing -- val_bpb=0.2292
What this does
Instead of hand-tuning alpha multipliers for each n-gram order (my previous submission at 0.2880), I replaced the mixing strategy with Bayesian posterior inference.
The formula:
This is the Dirichlet-Multinomial posterior predictive. The neural model is the prior, n-gram counts are the likelihood, concentration
ccontrols the tradeoff. Applied recursively from bigram up to 15-gram, where each order's smoothed estimate becomes the next order's prior.A single global concentration (c=5.0) handles the sparse-count problem that previously required hand-tuned per-order multipliers. The improvement is 0.059 BPB, which I didn't expect from replacing 14 tuned parameters with 1.
Results
Ablation chain
What's novel
Using a neural LM as the base measure in hierarchical Bayesian n-gram smoothing. Traditional Bayesian LMs (MacKay & Peto 1995, Teh 2006) use uniform or unigram priors. This is the Dirichlet special case (discount=0) of the Pitman-Yor family, a sibling to Kneser-Ney, not a generalization of it.
What's borrowed
N-gram cache approach from the community (especially @deanbrr, @lukacf, @Asukabot0, @newjordan). Complementary training from @pentxayc. Per-order multiplier concept from @AayushBaniya2006 (now replaced by Dirichlet). The Bayesian smoothing formula itself is textbook.
Compliance
Technical details
11L transformer (3 shared x 3 loops + 2 unique, EBLS), 512d, 8 heads / 4 KV heads (GQA), complementary n-gram training (alpha=0.5), 15-order recursive Bayesian backoff with concentration=5.0, int6 GPTQ + LZMA compression. ~14.9 MB artifact.
Feedback welcome.